Watch Out For Modulo And Hashes
A colleague and I were working on some data analysis tools today, and they encountered a puzzling error when processing a large chunk of data. Due to the volume of unique users that they were analyzing telemetry for, they decided to use sampling by hashing the user IDs and then taking a slice of the group they wanted to investigate.
Because they were using Kusto, they could rely on hash()
- a function that returns a hash based on the input value (it uses the xxhash algorithm behind the scenes). This would, in a way, normalize the user IDs under the same structure, that can then be sequenced.
There is one example in the aforementioned function document, that shows how to use the hash()
function with a little bit of statistical sugar - modulo (written as %). Recall how I mentioned that they needed to take a slice of the overall audience? Hashing the IDs and then applying modulo allows doing just that. They were puzzled because the sample used the following code snippet:
//The following example uses the hash function to
//run a query on 10% of the data
| where hash(StartTime, 10) == 0
When they tried to adjust the modulo parameter to 20, they thought they would get 20% of the users, respectively, however the output didn’t quite match the expectations - the number was much smaller. So what is going on here? Let’s explore how the hashing approach works to select a reliable sample of the audience.
First, you start with a set of IDs:
UserID |
---|
38235e66-021b-4092-b5e0-72687ba6dcce |
090d7552-6843-404c-85ac-30e9d93f265f |
83d4f1b7-061b-439b-83fa-6bf4e6105811 |
91e7af98-783d-4ecb-a416-734997b0ce27 |
5de7e32f-5de1-4307-adc0-fca54c33ead8 |
97aa5ec0-5011-4446-a3a2-eb4db541f127 |
04ca250b-2530-49ba-bcf1-c28b81f834ed |
2938e539-eb42-4c5c-816d-3ef6e6d052b2 |
f8f39fe5-3bba-4eae-a7a8-fd6e589eb5a0 |
c015bc77-dafd-495a-a15e-2692c0e35731 |
These values are subsequently hashed with the algorithm of choice. For the purposes of this article, I will be using xxhash as well (don’t focus too much on the choice of the hashing algorithm right now, really). Running the set above through a hashing function can be done using xxhash in Python:
import xxhash
x = xxhash.xxh64()
x.update(b'38235e66-021b-4092-b5e0-72687ba6dcce')
digest = x.hexdigest()
The output should be similar to this:
UserID |
---|
367b50760441849e |
2d0c85895cf04f08 |
32da77b165ab8d0e |
0e0c5bbc8559f1fc |
c22f1c8a59ec2e4c |
0a5dfc841d9d9586 |
b2bee630cc26677f |
ba69d71a5b3d66c5 |
63e75af38354423b |
77f3d9157770388a |
See how all values are of uniform length. Next, we need to apply the modulo function. You might wonder - how would I do that to a hash? First, we need to convert the hexadecimal value to an integer. You can do that in Python with intdigest()
, or in bash:
echo $((0x367b50760441849e))
Running this on our hashes, we get:
UserID | DecimalValue |
---|---|
367b50760441849e |
3925819967991284894 |
2d0c85895cf04f08 |
3246116256443551496 |
32da77b165ab8d0e |
3664372850617978126 |
0e0c5bbc8559f1fc |
1012284881500762620 |
c22f1c8a59ec2e4c |
13992433947803135564 |
0a5dfc841d9d9586 |
747030757576119686 |
b2bee630cc26677f |
12879985081584084863 |
ba69d71a5b3d66c5 |
13432503871809087173 |
63e75af38354423b |
7198822531301917243 |
77f3d9157770388a |
8643490796075497610 |
Now, when we specify the value for the modulo operation, we do not define the percentage of the audience we need, but rather the number of uniform buckets we want to have where we’ll classify user IDs. By now, this probably rings a bell if you’re familiar with hash tables. An expression like hash(x) % 5
would mean that we want to get five (5) buckets of hashes. Going back to our earlier Kusto example, where hash(StartTime, 10)
meant that there are 10 buckets, and each represents 10% of the overall audience. If we split this in 20 buckets, that would mean that we’d be looking at 5% of the audience in each bucket.
Taking the table above, if we’d want to get 20% of the hashes, we would write a function that would check if hash(x) % 5 == 0
, to get the chunk of the audience representative of the segment we want. Running modulo on DecimalValue
would give us this:
UserID | DecimalValue | ModuloOutput |
---|---|---|
367b50760441849e |
3925819967991284894 |
4 |
2d0c85895cf04f08 |
3246116256443551496 |
1 |
32da77b165ab8d0e |
3664372850617978126 |
1 |
0e0c5bbc8559f1fc |
1012284881500762620 |
0 |
c22f1c8a59ec2e4c |
13992433947803135564 |
4 |
0a5dfc841d9d9586 |
747030757576119686 |
1 |
b2bee630cc26677f |
12879985081584084863 |
3 |
ba69d71a5b3d66c5 |
13432503871809087173 |
3 |
63e75af38354423b |
7198822531301917243 |
3 |
77f3d9157770388a |
8643490796075497610 |
0 |
And there we have it - two values that are of interest are 0e0c5bbc8559f1fc
and 77f3d9157770388a
. And it’s representative of 20%. Rememember - doing modulo-based sampling means you are splitting values by bucket, not by percent of records.